Data website link: https://www.kaggle.com/spittman1248/cdc-data-nutrition-physical-activity-obesity

Data cleaning comment: The data set consists of 53392 observations and 16 attributes. The columns are filled with different data types such as integers, decimals, and characters. For data cleaning, I experimented using tibble because it is supposedly easier to work with large data. The select() function moves variables to a new variable, filter() function returns rows with matching conditions, and arrange() function sorts a variable in descending order.

library(tibble)
## Warning: package 'tibble' was built under R version 4.1.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.1.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(caret)
## Loading required package: lattice
library(class)
library(e1071)

CDC <- read.csv("cdc.csv")
# Basic R functions for data exploration
str(CDC)
## 'data.frame':    53392 obs. of  33 variables:
##  $ YearStart                 : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ YearEnd                   : int  2011 2011 2011 2011 2011 2011 2011 2011 2011 2011 ...
##  $ LocationAbbr              : chr  "AL" "AL" "AL" "AL" ...
##  $ LocationDesc              : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ Datasource                : chr  "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" "Behavioral Risk Factor Surveillance System" ...
##  $ Class                     : chr  "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
##  $ Topic                     : chr  "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" "Obesity / Weight Status" ...
##  $ Question                  : chr  "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have obesity" "Percent of adults aged 18 years and older who have obesity" ...
##  $ Data_Value_Unit           : logi  NA NA NA NA NA NA ...
##  $ Data_Value_Type           : chr  "Value" "Value" "Value" "Value" ...
##  $ Data_Value                : num  32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
##  $ Data_Value_Alt            : num  32 32.3 31.8 33.6 32.8 33.8 26.4 16.3 35.2 35.5 ...
##  $ Data_Value_Footnote_Symbol: chr  "" "" "" "" ...
##  $ Data_Value_Footnote       : chr  "" "" "" "" ...
##  $ Low_Confidence_Limit      : num  30.5 29.9 30 29.9 30.2 31 23.7 12.6 30.7 31.6 ...
##  $ High_Confidence_Limit     : num  33.5 34.7 33.6 37.6 35.6 36.8 29.3 20.9 40 39.6 ...
##  $ Sample_Size               : int  7304 2581 4723 1153 2402 1925 1812 356 598 865 ...
##  $ Total                     : chr  "Total" "" "" "" ...
##  $ Age.years.                : chr  "" "" "" "" ...
##  $ Education                 : chr  "" "" "" "Less than high school" ...
##  $ Gender                    : chr  "" "Male" "Female" "" ...
##  $ Income                    : chr  "" "" "" "" ...
##  $ Race.Ethnicity            : chr  "" "" "" "" ...
##  $ GeoLocation               : chr  "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" "(32.84057112200048, -86.63186076199969)" ...
##  $ ClassID                   : chr  "OWS" "OWS" "OWS" "OWS" ...
##  $ TopicID                   : chr  "OWS1" "OWS1" "OWS1" "OWS1" ...
##  $ QuestionID                : chr  "Q036" "Q036" "Q036" "Q036" ...
##  $ DataValueTypeID           : chr  "VALUE" "VALUE" "VALUE" "VALUE" ...
##  $ LocationID                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ StratificationCategory1   : chr  "Total" "Gender" "Gender" "Education" ...
##  $ Stratification1           : chr  "Total" "Male" "Female" "Less than high school" ...
##  $ StratificationCategoryId1 : chr  "OVR" "GEN" "GEN" "EDU" ...
##  $ StratificationID1         : chr  "OVERALL" "MALE" "FEMALE" "EDUHS" ...
names(CDC)
##  [1] "YearStart"                  "YearEnd"                   
##  [3] "LocationAbbr"               "LocationDesc"              
##  [5] "Datasource"                 "Class"                     
##  [7] "Topic"                      "Question"                  
##  [9] "Data_Value_Unit"            "Data_Value_Type"           
## [11] "Data_Value"                 "Data_Value_Alt"            
## [13] "Data_Value_Footnote_Symbol" "Data_Value_Footnote"       
## [15] "Low_Confidence_Limit"       "High_Confidence_Limit"     
## [17] "Sample_Size"                "Total"                     
## [19] "Age.years."                 "Education"                 
## [21] "Gender"                     "Income"                    
## [23] "Race.Ethnicity"             "GeoLocation"               
## [25] "ClassID"                    "TopicID"                   
## [27] "QuestionID"                 "DataValueTypeID"           
## [29] "LocationID"                 "StratificationCategory1"   
## [31] "Stratification1"            "StratificationCategoryId1" 
## [33] "StratificationID1"
summary(CDC)
##    YearStart       YearEnd     LocationAbbr       LocationDesc      
##  Min.   :2011   Min.   :2011   Length:53392       Length:53392      
##  1st Qu.:2012   1st Qu.:2012   Class :character   Class :character  
##  Median :2013   Median :2013   Mode  :character   Mode  :character  
##  Mean   :2013   Mean   :2013                                        
##  3rd Qu.:2015   3rd Qu.:2015                                        
##  Max.   :2016   Max.   :2016                                        
##                                                                     
##   Datasource           Class              Topic             Question        
##  Length:53392       Length:53392       Length:53392       Length:53392      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Data_Value_Unit Data_Value_Type      Data_Value    Data_Value_Alt 
##  Mode:logical    Length:53392       Min.   : 0.90   Min.   : 0.90  
##  NA's:53392      Class :character   1st Qu.:24.10   1st Qu.:24.10  
##                  Mode  :character   Median :30.70   Median :30.70  
##                                     Mean   :31.16   Mean   :31.16  
##                                     3rd Qu.:37.00   3rd Qu.:37.00  
##                                     Max.   :77.60   Max.   :77.60  
##                                     NA's   :5046    NA's   :5046   
##  Data_Value_Footnote_Symbol Data_Value_Footnote Low_Confidence_Limit
##  Length:53392               Length:53392        Min.   : 0.30       
##  Class :character           Class :character    1st Qu.:20.00       
##  Mode  :character           Mode  :character    Median :26.45       
##                                                 Mean   :26.89       
##                                                 3rd Qu.:32.90       
##                                                 Max.   :69.50       
##                                                 NA's   :5046        
##  High_Confidence_Limit  Sample_Size        Total            Age.years.       
##  Min.   : 3.00         Min.   :    50   Length:53392       Length:53392      
##  1st Qu.:28.20         1st Qu.:   566   Class :character   Class :character  
##  Median :35.60         Median :  1209   Mode  :character   Mode  :character  
##  Mean   :35.99         Mean   :  3889                                        
##  3rd Qu.:42.20         3rd Qu.:  2519                                        
##  Max.   :87.70         Max.   :476876                                        
##  NA's   :5046          NA's   :5046                                          
##   Education            Gender             Income          Race.Ethnicity    
##  Length:53392       Length:53392       Length:53392       Length:53392      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  GeoLocation          ClassID            TopicID           QuestionID       
##  Length:53392       Length:53392       Length:53392       Length:53392      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  DataValueTypeID      LocationID    StratificationCategory1 Stratification1   
##  Length:53392       Min.   : 1.00   Length:53392            Length:53392      
##  Class :character   1st Qu.:17.00   Class :character        Class :character  
##  Mode  :character   Median :30.00   Mode  :character        Mode  :character  
##                     Mean   :30.28                                             
##                     3rd Qu.:44.00                                             
##                     Max.   :78.00                                             
##                                                                               
##  StratificationCategoryId1 StratificationID1 
##  Length:53392              Length:53392      
##  Class :character          Class :character  
##  Mode  :character          Mode  :character  
##                                              
##                                              
##                                              
## 
head(CDC)
tail(CDC)
CDC2 <- select(CDC, YearEnd, LocationDesc, LocationAbbr, Question, Sample_Size, Data_Value, Low_Confidence_Limit, Education, Gender, Income,`Age.years.`,`Race.Ethnicity`, Stratification1, StratificationCategoryId1, StratificationID1, GeoLocation, Sample_Size)
as_tibble(CDC2)
CDC_overweight <- select(CDC2, YearEnd, LocationDesc, LocationAbbr, Question, Sample_Size, Data_Value, Stratification1)

CDC_overweight <- filter(CDC_overweight, Question == "Percent of adults aged 18 years and older who have an overweight classification", LocationAbbr == "US", Stratification1 == "Total", LocationAbbr != "PR", LocationAbbr != "GU", LocationAbbr !="VI")

CDC_obese <- select(CDC2, YearEnd, LocationDesc, LocationAbbr, Question, Sample_Size, Data_Value, Stratification1)

CDC_obese <- filter(CDC_obese, Question == "Percent of adults aged 18 years and older who have obesity", LocationAbbr == "US", Stratification1 == "Total", LocationAbbr != "PR", LocationAbbr != "GU", LocationAbbr !="VI")

arrange(CDC_overweight, YearEnd)
arrange(CDC_obese, YearEnd)

Graphs comment: According to the graph, overweight rates have decreased until 2014, which infers that Americans are losing weight annually. However, the obesity rates have been increasing throughout the years, so Americans are gaining weight. In other words, people classified as overweight are transitioning to obese.

overweight_plot <- ggplot(data=CDC_overweight, aes(x = YearEnd, y = Data_Value, group=1)) +
geom_line(color="red")+
geom_point()+labs(title = "2011-2016 US Overweight Rates", x = "Year", y = "Overweight Rate(%)")

obese_plot  <- ggplot(data=CDC_obese, aes(x = YearEnd, y = Data_Value, group=1)) +
geom_line(color="blue")+
geom_point()+labs(title = "2011-2016 US Obesity Rates", x = "Year", y = "Obesity Rate(%)")

grid.arrange(arrangeGrob(obese_plot, overweight_plot, ncol = 2))

Linear regression comment: I decided to choose these features because it seemed logical to assume that there would be a positive relationship between adults not consuming vegetables and the obesity rate. As a result, I wanted to affirm that my hypothesis was correct. The linear regression verifies that not eating vegetables could lead to an increase in obesity. Obs_rate = 0.04512 * Veg + 28.34208. As the percentage of not eating vegetables increases, so will the obesity rate. The RSE is 7.143, making the percentage error (7.143/50.81648) * 100 = 14.056. The RMSE states that our test data is off by a 7.128568 obesity rate on average. Everything looks decent except the fact that the p-value is not very small.

Veg <- select(CDC2, Question, Data_Value)
Veg <- filter(Veg, Question=="Percent of adults who report consuming vegetables less than one time daily")
Veg <- select(Veg, Data_Value)

Obs_rate <- select(CDC2, Question, Data_Value)
Obs_rate <- filter(Obs_rate, Question == "Percent of adults aged 18 years and older who have obesity")
Obs_rate <- select(Obs_rate, Data_Value)

Veg <- na.omit(Veg)
Obs_rate <- na.omit(Obs_rate)

Veg <- unlist(Veg)
Obs_rate <- unlist(Obs_rate)

Veg <- Veg[-c(500:3996)]
Obs_rate <- Obs_rate[-c(500:8127)]

beg <- proc.time()
lm1 <- lm(Obs_rate~Veg)
end <- proc.time()
summary(lm1)
## 
## Call:
## lm(formula = Obs_rate ~ Veg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.7611  -3.8328   0.1856   4.7247  24.7887 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 28.34208    1.07104  26.462   <2e-16 ***
## Veg          0.04512    0.04409   1.023    0.307    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.143 on 497 degrees of freedom
## Multiple R-squared:  0.002103,   Adjusted R-squared:  9.5e-05 
## F-statistic: 1.047 on 1 and 497 DF,  p-value: 0.3066
mse1 <- mean(lm1$residuals^2)
mse1
## [1] 50.81648
rmse1 <- sqrt(mse1)
rmse1
## [1] 7.128568
time <- end - beg
time
##    user  system elapsed 
##    0.02    0.00    0.01
plot(Veg, Obs_rate, main = "Scatterplot of Adults Not Consuming Vegatables Daily vs Obesity Rate", xlab = "Adults Not Consuming Vegatables Daily (%)", ylab = "Obesity Rate (%)", col = "black")
abline(lm1, col = "blue")

kNN regression comment: I would think the percentage of people not eating vegetables and not exercising would affect the obesity rate. However, the accuracy is not high even with the most optimal k-value, so we can assume that kNN might not be the right approach here. The time performance is a bit slower than linear regression as well. Another reason is that the data wasn’t correctly scaled. A solution to find the best k is to use a for loop and traverse the kNN function within it. I manually changed the k value because the range for my k is not substantial.

Phy <- select(CDC2, Question, Data_Value)
Phy <- filter(Phy, Question == "Percent of adults who engage in no leisure-time physical activity")
Phy <- select(Phy, Data_Value)
Phy <- na.omit(Phy)

beg <- proc.time()
knn_pred <- knn(Veg, Phy, Obs_rate, k = 3)
end <- proc.time()
knn_pred <- knn_pred[-c(500:8127)]
knn_pred <- as.numeric(knn_pred)
mean(knn_pred == Obs_rate)
## [1] 0
time <- end - beg
time
##    user  system elapsed 
##    0.02    0.00    0.01

SVM regression comment: I made gender a factor related to the year, sample size, data value, and confidence limits. I chose gender to be a factor because it is a binary value where (0 equals female and 1 equals male) so it can provide some valuable data combined with other data elements. The runtime is two milliseconds slower than kNN. Nonetheless, the accuracy is 88.4%, so the results suggest that our model is a good classifier for the data.

CDC3 <- select(CDC, YearEnd, Sample_Size, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Gender)
CDC3$Gender <- as.factor(CDC3$Gender)

CDC3 <- CDC3[-c(500:nrow(CDC3)), ]
CDC3 <- CDC3[!apply(CDC3 == "", 1, all), ]
CDC3 <- na.omit(CDC3)

set.seed(1234)
i <- sample(1:nrow(CDC3), 0.75*nrow(CDC3), replace=FALSE)
train <- CDC3[i,]
test <- CDC3[-i,]

beg <- proc.time()
svm1 <- svm(Gender~ ., data = train, kernel = "linear", cost = 10, scale = TRUE)
end <- proc.time()
summary(svm1)
## 
## Call:
## svm(formula = Gender ~ ., data = train, kernel = "linear", cost = 10, 
##     scale = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  49
## 
##  ( 16 19 14 )
## 
## 
## Number of Classes:  3 
## 
## Levels: 
##   Female Male
pred <- predict(svm1, newdata = test)
table(pred == test$Gender)
## 
## FALSE  TRUE 
##    13    99
mean(pred == test$Gender)
## [1] 0.8839286
time <- end - beg
time
##    user  system elapsed 
##    0.03    0.00    0.03

Results analysis: The processing order goes from linear regression, kNN, and SVM, where linear regression is the fastest. Linear regression is interpretable and straightforward but sensitive to outliers. kNN does not assume the shape of the data compared to linear regression, but extra steps like scaling the data are necessary to get good results. SVM deals with complex calculations and decision boundaries, so computing results prove to be stressful. Data doesn’t show a correlation between not eating vegetables and not exercising with obesity prevalence. However, there could be some computation error or data faults. This would make sense because the energy obtained from eating foods is not expended. Thus, logically speaking, not eating vegetables and not exercising would correlate to the obesity rate in some way or another.